Imagine a traditional classroom, we have quizzes and exams. Quizzes test your learning and exams validate our knowledge gained in that course on the whole. However, the ultimate goal for a course is training a student to be able to apply that concept in other fields of career not just to get good marks in quizzes and exams. Similarly, we do not train a neural network to get amazing performance on the training set. We want a neural network perform well on entirely new data. Compared to training data performace, networks usually does not perform as well with new data.
I want to start the discussion with occum's Razer which suggests us to choose the simplest model that works. Choosing a simple model for a neural network is difficult because it is inherently complex. A neural network learns the distribution of data while training so that it can work on new data in that distribution(test accuracy). There will be a slight performance drop between the training time and testing time and this drop is called generalization error. In few cases when the training is too aggressive, the network starts learning the data and starts fitting the data instead of learning the data distribution which results in a poor performance at test time. This is a result of poor generalizability of the network. We use some techniques to improve the generalization of a network and these techniques are also called as regularization. There are three main regularizations used in neural networks
Regularization helps us to simplify our final model even with a complex architecture. One classic type of regularization is weight penalities which keeps the values of weight vectors in check. We achieve this we add the norm of the weight vector to the error function to get the final cost function. We can use any norm from $L^1$ to $L^\infty$. The most widely used norms are $L^2$ and $L^1$.
$L^2$ Regularization is also called as Ridge Regression or Tikhonov regularization. Among the weight penalities $L^2$ is the most used weight penality. $L^2$ Regularization penalizes the bigger weights. We achieve regularization by adding square of $L^2$ norm to the cost function. mathematical representation of $L^2$ regularization is given by: $$Cost = E(X) + \lambda \parallel W \parallel_2 ^ 2$$ New Gradient g of the cost function $E(X)$ w.r.t to Weights w is given by: $$g = \frac{\partial E(X)}{\partial W} + 2 \lambda W$$
$\lambda$ is the regularization coefficient that can be used to control the level of regularization.
In $L^1$ Regularization we add the first norm of the weight vector to the cost function. $L^1$ Regularization penalizes the weights that are not zero. It forces the weights to be zero as a result of which the final parameters are sparse with most of the weights bring zero. Mathematical representation of $L^1$ regularization is given by: $$Cost = E(X) + \lambda \parallel W \parallel_1$$ New Gradient g of the cost function $E(X)$ w.r.t to Weights w is given by: $$g = \frac{\partial E(X)}{\partial W} + \lambda sign(W)$$
We do not have to restrict ourselves to one weight Norm penality for a parameter. We can have a combination of more than one weight penalities. Our final model will be impacted by the properties of all the regularizers. For example, If we use both $L^1$ and $L^2$ weight penalities in our model then the cost function becomes $$Cost = E(X) + \lambda_2 \parallel W \parallel_2 ^ 2 + \lambda_1 \parallel W \parallel_1$$ New Gradient g of the cost function $E(X)$ w.r.t to Weight vector W is given by: $$g = \frac{\partial E(X)}{\partial W} + 2 \lambda_2 W + \lambda sign(W) $$
YANN has a flexibility of regularizing selected layer or an entire network. To regularize a layer, we should set the following arguments for network.add_layer() function
regularize – True is you want to apply regularization, False if not. regularizer – coeffients for L1, L2 regulaizer coefficients,Default is (0.001, 0.001).To give common regularization parameters for entire network, we can give regularization argument for optimizer parameters.
"regularization" : (l1_coeff, l2_coeff). Default is (0.001, 0.001)
Let's see Regularization in action:
In [1]:
from yann.network import network
from yann.utils.graph import draw_network
from yann.special.datasets import cook_mnist
def lenet5 ( dataset= None, verbose = 1, regularization = None ):
"""
This function is a demo example of lenet5 from the infamous paper by Yann LeCun.
This is an example code. You should study this code rather than merely run it.
Warning:
This is not the exact implementation but a modern re-incarnation.
Args:
dataset: Supply a dataset.
verbose: Similar to the rest of the dataset.
"""
optimizer_params = {
"momentum_type" : 'nesterov',
"momentum_params" : (0.65, 0.97, 30),
"optimizer_type" : 'rmsprop',
"id" : "main"
}
dataset_params = {
"dataset" : dataset,
"svm" : False,
"n_classes" : 10,
"id" : 'data'
}
visualizer_params = {
"root" : 'lenet5',
"frequency" : 1,
"sample_size": 144,
"rgb_filters": True,
"debug_functions" : False,
"debug_layers": False, # Since we are on steroids this time, print everything.
"id" : 'main'
}
# intitialize the network
net = network( borrow = True,
verbose = verbose )
# or you can add modules after you create the net.
net.add_module ( type = 'optimizer',
params = optimizer_params,
verbose = verbose )
net.add_module ( type = 'datastream',
params = dataset_params,
verbose = verbose )
net.add_module ( type = 'visualizer',
params = visualizer_params,
verbose = verbose
)
# add an input layer
net.add_layer ( type = "input",
id = "input",
verbose = verbose,
datastream_origin = 'data', # if you didnt add a dataset module, now is
# the time.
mean_subtract = False )
# add first convolutional layer
net.add_layer ( type = "conv_pool",
origin = "input",
id = "conv_pool_1",
num_neurons = 20,
filter_size = (5,5),
pool_size = (2,2),
activation = 'maxout(2,2)',
# regularize = True,
verbose = verbose
)
net.add_layer ( type = "conv_pool",
origin = "conv_pool_1",
id = "conv_pool_2",
num_neurons = 50,
filter_size = (3,3),
pool_size = (2,2),
activation = 'relu',
# regularize = True,
verbose = verbose
)
net.add_layer ( type = "dot_product",
origin = "conv_pool_2",
id = "dot_product_1",
num_neurons = 1250,
activation = 'relu',
# regularize = True,
verbose = verbose
)
net.add_layer ( type = "dot_product",
origin = "dot_product_1",
id = "dot_product_2",
num_neurons = 1250,
activation = 'relu',
# regularize = True,
verbose = verbose
)
net.add_layer ( type = "classifier",
id = "softmax",
origin = "dot_product_2",
num_classes = 10,
# regularize = True,
activation = 'softmax',
verbose = verbose
)
net.add_layer ( type = "objective",
id = "obj",
origin = "softmax",
objective = "nll",
datastream_origin = 'data',
regularization = regularization,
verbose = verbose
)
learning_rates = (0.05, .0001, 0.001)
net.pretty_print()
# draw_network(net.graph, filename = 'lenet.png')
net.cook()
net.train( epochs = (20, 20),
validate_after_epochs = 1,
training_accuracy = True,
learning_rates = learning_rates,
show_progress = True,
early_terminate = True,
patience = 2,
verbose = verbose)
print(net.test(verbose = verbose))
data = cook_mnist()
dataset = data.dataset_location()
lenet5 ( dataset, verbose = 2)
In [ ]:
net.